The major dilemma of a salesman happens to guess the price range the customer is looking for to purchase a particular product. It is sometimes considered to be rude to directly ask for a customer's budget. Hence, Analytics Educator is trying to help build a predictive model to predict the total amount that customers are willing to pay. In this case the we have taken the data of cars and our predictive algorithm will help us understand the price range the customer is looking to buy the car at. We have a dataset with the following variables:
The model should predict:
We will be using two of the most important and robust technique of modern data science industry to predict the model. The two algorithms are Artificial Neural Network and Extreme Gradient Boosting. Once done, we will compare the results to see which algorithm has given a better accuracy.
A subset of machine learning methods called Artificial Neural Networks (ANN) is designed to mimic the structure and operation of the human brain. They are made to identify intricate data patterns and generate predictions based on that analysis. Due to its performance in a variety of applications, including image recognition, natural language processing, and speech recognition, as well as their capacity to process massive volumes of data fast and accurately, ANN have become quite popular.
ANNs are made up mostly of layers of interconnected nodes, also referred to as artificial neurons. These neurons take in information, process it mathematically, and then send their results to the layer of neurons below them. ANNs modify their weights and biases through a technique known as backpropagation to enhance their performance on a specific task. As a result, a strong and adaptable machine learning system that can be trained to handle many challenging issues has been created.
The performance of ANNs has significantly improved, and their range of potential applications has grown, in recent years as a result of advancements in processing power and data accessibility. As a result, ANNs are now an essential tool for scientists, engineers, and researchers working in a variety of fields.
Artificial Neural Networks (ANN) are made up of interconnected nodes that process input, output results, and are modelled after the structure and operation of the human brain. In order to perform better on a particular job, ANNs employ a learning method to modify the weights, or the strength of connections between nodes.
An artificial neuron, which receives input from other neurons and generates an output, is the fundamental component of an ANN. Each input is multiplied by a weight before being added together as a whole. The sum is multiplied by a bias factor, and the result is then run through an activation function to determine the neuron's output. Other neurons in the subsequent layer receive this output after that.
The input layer of an ANN receives data and passes it to one or more hidden layers. ANNs are organized into layers. The network's ultimate output is produced by the output layer. The network is shown instances of input data and their matching outputs during training. Based on the discrepancy between the projected output and the actual output, the network's weights and biases are modified. Backpropagation is a technique used to boost the network's efficiency when doing the activity.
Once trained, the ANN can be applied to new data to produce predictions. The input data is sent through the network, and the weights and biases that were discovered during training are used to determine the output. Numerous tasks, including audio and picture identification, natural language processing, and financial forecasting, have been successfully completed with ANNs.
Extreme Gradient Boosting, or XGBoost, is a potent and well-liked open-source machine learning framework used for supervised learning issues including regression and classification. It is a distributed gradient boosting library that has been optimised to be very effective, adaptable, and portable.
Each weak decision tree in the ensemble created by XGBoost is trained to fix the mistakes produced by the one before it. When training, XGBoost iteratively adds additional decision trees to the ensemble in order to optimise a loss function.
The gradient of the loss function with respect to each input feature is calculated by the algorithm to determine the best split points for each tree. With this method, XGBoost can manage enormous datasets and deliver cutting-edge performance on a range of workloads.
Using XGBoost, users can adjust a wide range of variables, including the learning rate, the maximum depth of each tree, and the number of trees in the ensemble. Additionally, it enables different forms of regularisation to reduce overfitting and boost generalisation efficiency. Additionally, to aid with hyperparameter tuning and avoid overfitting, XGBoost offers helpful features like integrated cross-validation and early stopping.
Numerous applications, such as online advertising, fraud detection, and natural language processing, have effectively exploited XGBoost. Its effectiveness, scalability, and versatility have led to its widespread adoption in both academia and industry.
Extreme Gradient Boosting, or XGBoost, is a supervised machine learning algorithm that boosts model accuracy using gradients. The algorithm builds a group of weak decision trees, each of which is trained to fix the mistakes caused by the one before it. The fundamental idea behind XGBoost is to reduce a loss function by expanding the ensemble with new decision trees that suit the residuals of the older ones. By repeatedly include fresh trees in the ensemble during training, XGBoost improves the loss function. The approach calculates the gradient of the loss function with respect to each input feature after each tree has been trained using a fraction of the data, and then chooses the optimum split points for each tree.
A single decision tree, a straightforward model that forecasts the target variable based on a collection of input data, is the first thing the algorithm builds. The subsequent decision tree is then trained using the residuals (the discrepancy between the predicted and actual values) from the first model. Up until the necessary number of trees is attained, the process of generating a decision tree and using its residuals to train the following tree is repeated.
XGBoost supports a number of regularisation techniques, including L1 and L2 regularisation, which penalise large weights or restrict the complexity of each tree, to avoid overfitting. Additionally, to aid with hyperparameter tuning and avoid overfitting, XGBoost offers features like integrated cross-validation and early stopping.
The predictions from each tree are combined using XGBoost to get the final output after the ensemble of trees has been trained. When dealing with classification problems, XGBoost transforms the predicted scores into class probabilities using a softmax function. When dealing with regression issues, XGBoost averages the anticipated values.
Overall, XGBoost is a strong and adaptable machine learning algorithm that excels at a wide range of tasks and can handle enormous datasets. Because of its effectiveness, scalability, and versatility, it is frequently utilised in both academia and industry.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
os.chdir("C:\\Users\\ASUS\\Desktop")
car_df = pd.read_csv('Car_Purchasing_Data.csv', encoding='ISO-8859-1')
car_df.head()
| Customer Name | Customer e-mail | Country | Gender | Age | Annual Salary | Credit Card Debt | Net Worth | Car Purchase Amount | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Martina Avila | cubilia.Curae.Phasellus@quisaccumsanconvallis.edu | Bulgaria | 0 | 41.851720 | 62812.09301 | 11609.380910 | 238961.2505 | 35321.45877 |
| 1 | Harlan Barnes | eu.dolor@diam.co.uk | Belize | 0 | 40.870623 | 66646.89292 | 9572.957136 | 530973.9078 | 45115.52566 |
| 2 | Naomi Rodriquez | vulputate.mauris.sagittis@ametconsectetueradip... | Algeria | 1 | 43.152897 | 53798.55112 | 11160.355060 | 638467.1773 | 42925.70921 |
| 3 | Jade Cunningham | malesuada@dignissim.com | Cook Islands | 1 | 58.271369 | 79370.03798 | 14426.164850 | 548599.0524 | 67422.36313 |
| 4 | Cedric Leach | felis.ullamcorper.viverra@egetmollislectus.net | Brazil | 1 | 57.313749 | 59729.15130 | 5358.712177 | 560304.0671 | 55915.46248 |
Here "Car Purchase Amount" is our dependent variable; we need to predict it based on other independent variables like Age, Annual Salary, Credit Card Debt etc.
car_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Customer Name 500 non-null object
1 Customer e-mail 500 non-null object
2 Country 500 non-null object
3 Gender 500 non-null int64
4 Age 500 non-null float64
5 Annual Salary 500 non-null float64
6 Credit Card Debt 500 non-null float64
7 Net Worth 500 non-null float64
8 Car Purchase Amount 500 non-null float64
dtypes: float64(5), int64(1), object(3)
memory usage: 35.3+ KB
We understand that there are total 500 rows with 9 columns in the data. Customer Name, Customer e-mail and Country are the characters. All character variables should be checked since they might be required to be converted into dummy variables.
n = car_df.nunique(axis=0)
n
Customer Name 498
Customer e-mail 500
Country 211
Gender 2
Age 500
Annual Salary 500
Credit Card Debt 500
Net Worth 500
Car Purchase Amount 500
dtype: int64
We can see that the object (character) variables - Customer Name, Customer e-mail and Country are having 498, 500, and 211 unique values. It means that almost all the values are unique, and rarely we have common values. Hence, these variables are of no use to us as even if we create dummy variables out of it, we will hardly have any impact on the dependent variable. We will drop it.
car_df = car_df.drop(['Customer Name', 'Customer e-mail', 'Country'], axis = 1)
car_df.head(2)
| Gender | Age | Annual Salary | Credit Card Debt | Net Worth | Car Purchase Amount | |
|---|---|---|---|---|---|---|
| 0 | 0 | 41.851720 | 62812.09301 | 11609.380910 | 238961.2505 | 35321.45877 |
| 1 | 0 | 40.870623 | 66646.89292 | 9572.957136 | 530973.9078 | 45115.52566 |
import pandas as pd
import matplotlib.pyplot as plt
import random
import numpy as np
import seaborn as sns
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
corr = car_df.corr()
thresh = 0
kot = corr[((corr>=thresh) | (corr<= -thresh))& (corr != 1)]
plt.figure(figsize=(10,3))
sns.heatmap(kot, cmap="Reds",annot=True)
<AxesSubplot:>
# We remove the label values from our training data
X = car_df.drop(['Car Purchase Amount'],axis=1)
# We assigned those label values to our Y dataset
y = car_df['Car Purchase Amount']
# Split it to a 70:30 Ratio Train:Test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=42)
# Importing the Keras libraries and packages
import tensorflow as tf
### Initializing the ANN
ann = tf.keras.models.Sequential()
### Adding the input layer and the first hidden layer"""
ann.add(tf.keras.layers.Dense(units=6, activation='relu'))
ann.add(tf.keras.layers.Dense(units=6, activation='relu'))
### Adding the output layer"""#only 1 output hence 1 layer
ann.add(tf.keras.layers.Dense(units=1))
### Compiling the ANN
ann.compile(optimizer = 'adam', loss = 'mean_squared_error')
### Training the ANN model on the Training set"""
# first convert it into a numpy array since list doesn't work
import numpy as np
y_train = np.array(y_train)
X_train = np.array(X_train)
ann.fit(X_train, y_train, batch_size = 25, epochs = 100)
Epoch 1/100
14/14 [==============================] - 0s 846us/step - loss: 2254526208.0000
Epoch 2/100
14/14 [==============================] - 0s 769us/step - loss: 874682560.0000
Epoch 3/100
14/14 [==============================] - 0s 1ms/step - loss: 336855360.0000
Epoch 4/100
14/14 [==============================] - 0s 1ms/step - loss: 183744560.0000
Epoch 5/100
14/14 [==============================] - 0s 769us/step - loss: 132019112.0000
Epoch 6/100
14/14 [==============================] - 0s 923us/step - loss: 112405392.0000
Epoch 7/100
14/14 [==============================] - 0s 769us/step - loss: 94877496.0000
Epoch 8/100
14/14 [==============================] - 0s 846us/step - loss: 80443808.0000
Epoch 9/100
14/14 [==============================] - 0s 846us/step - loss: 69199248.0000
Epoch 10/100
14/14 [==============================] - 0s 923us/step - loss: 61142228.0000
Epoch 11/100
14/14 [==============================] - 0s 923us/step - loss: 56840748.0000
Epoch 12/100
14/14 [==============================] - 0s 769us/step - loss: 55082956.0000
Epoch 13/100
14/14 [==============================] - 0s 846us/step - loss: 54673260.0000
Epoch 14/100
14/14 [==============================] - 0s 769us/step - loss: 54076820.0000
Epoch 15/100
14/14 [==============================] - 0s 923us/step - loss: 53065920.0000
Epoch 16/100
14/14 [==============================] - 0s 846us/step - loss: 51914512.0000
Epoch 17/100
14/14 [==============================] - 0s 923us/step - loss: 51449984.0000
Epoch 18/100
14/14 [==============================] - 0s 846us/step - loss: 51118772.0000
Epoch 19/100
14/14 [==============================] - 0s 923us/step - loss: 50901304.0000
Epoch 20/100
14/14 [==============================] - 0s 692us/step - loss: 51026604.0000
Epoch 21/100
14/14 [==============================] - 0s 923us/step - loss: 50234812.0000
Epoch 22/100
14/14 [==============================] - 0s 923us/step - loss: 49995952.0000
Epoch 23/100
14/14 [==============================] - 0s 846us/step - loss: 49874324.0000
Epoch 24/100
14/14 [==============================] - 0s 846us/step - loss: 49665704.0000
Epoch 25/100
14/14 [==============================] - 0s 769us/step - loss: 49508592.0000
Epoch 26/100
14/14 [==============================] - 0s 769us/step - loss: 49688080.0000
Epoch 27/100
14/14 [==============================] - 0s 923us/step - loss: 49259988.0000
Epoch 28/100
14/14 [==============================] - 0s 923us/step - loss: 48994604.0000
Epoch 29/100
14/14 [==============================] - 0s 923us/step - loss: 49293124.0000
Epoch 30/100
14/14 [==============================] - 0s 923us/step - loss: 48767880.0000
Epoch 31/100
14/14 [==============================] - 0s 769us/step - loss: 48721324.0000
Epoch 32/100
14/14 [==============================] - 0s 923us/step - loss: 48790360.0000
Epoch 33/100
14/14 [==============================] - 0s 769us/step - loss: 48626628.0000
Epoch 34/100
14/14 [==============================] - 0s 846us/step - loss: 48443716.0000
Epoch 35/100
14/14 [==============================] - 0s 846us/step - loss: 48344956.0000
Epoch 36/100
14/14 [==============================] - 0s 846us/step - loss: 48129692.0000
Epoch 37/100
14/14 [==============================] - 0s 846us/step - loss: 48067148.0000
Epoch 38/100
14/14 [==============================] - 0s 769us/step - loss: 47960180.0000
Epoch 39/100
14/14 [==============================] - 0s 769us/step - loss: 48106928.0000
Epoch 40/100
14/14 [==============================] - 0s 3ms/step - loss: 48997680.0000
Epoch 41/100
14/14 [==============================] - 0s 923us/step - loss: 47792044.0000
Epoch 42/100
14/14 [==============================] - 0s 769us/step - loss: 47327528.0000
Epoch 43/100
14/14 [==============================] - 0s 769us/step - loss: 47432068.0000
Epoch 44/100
14/14 [==============================] - 0s 846us/step - loss: 47894848.0000
Epoch 45/100
14/14 [==============================] - ETA: 0s - loss: 37502864.00 - 0s 923us/step - loss: 47180852.0000
Epoch 46/100
14/14 [==============================] - 0s 846us/step - loss: 47228808.0000
Epoch 47/100
14/14 [==============================] - 0s 769us/step - loss: 47633572.0000
Epoch 48/100
14/14 [==============================] - 0s 769us/step - loss: 48302672.0000
Epoch 49/100
14/14 [==============================] - 0s 923us/step - loss: 47130052.0000
Epoch 50/100
14/14 [==============================] - 0s 923us/step - loss: 47047244.0000
Epoch 51/100
14/14 [==============================] - 0s 846us/step - loss: 47380172.0000
Epoch 52/100
14/14 [==============================] - 0s 923us/step - loss: 47480752.0000
Epoch 53/100
14/14 [==============================] - 0s 923us/step - loss: 47162876.0000
Epoch 54/100
14/14 [==============================] - 0s 769us/step - loss: 47209596.0000
Epoch 55/100
14/14 [==============================] - 0s 692us/step - loss: 47092436.0000
Epoch 56/100
14/14 [==============================] - 0s 769us/step - loss: 46708604.0000
Epoch 57/100
14/14 [==============================] - 0s 846us/step - loss: 47128296.0000
Epoch 58/100
14/14 [==============================] - 0s 923us/step - loss: 47467568.0000
Epoch 59/100
14/14 [==============================] - 0s 769us/step - loss: 47212180.0000
Epoch 60/100
14/14 [==============================] - 0s 692us/step - loss: 46686548.0000
Epoch 61/100
14/14 [==============================] - 0s 769us/step - loss: 46814136.0000
Epoch 62/100
14/14 [==============================] - 0s 923us/step - loss: 46909644.0000
Epoch 63/100
14/14 [==============================] - 0s 846us/step - loss: 46811908.0000
Epoch 64/100
14/14 [==============================] - 0s 846us/step - loss: 46746632.0000
Epoch 65/100
14/14 [==============================] - 0s 769us/step - loss: 46422720.0000
Epoch 66/100
14/14 [==============================] - 0s 769us/step - loss: 46900496.0000
Epoch 67/100
14/14 [==============================] - 0s 769us/step - loss: 46667608.0000
Epoch 68/100
14/14 [==============================] - 0s 846us/step - loss: 46425048.0000
Epoch 69/100
14/14 [==============================] - 0s 769us/step - loss: 46460028.0000
Epoch 70/100
14/14 [==============================] - 0s 769us/step - loss: 46731956.0000
Epoch 71/100
14/14 [==============================] - 0s 846us/step - loss: 46738132.0000
Epoch 72/100
14/14 [==============================] - 0s 846us/step - loss: 46763880.0000
Epoch 73/100
14/14 [==============================] - 0s 846us/step - loss: 46602308.0000
Epoch 74/100
14/14 [==============================] - 0s 769us/step - loss: 46635252.0000
Epoch 75/100
14/14 [==============================] - 0s 769us/step - loss: 46278932.0000
Epoch 76/100
14/14 [==============================] - 0s 846us/step - loss: 47537420.0000
Epoch 77/100
14/14 [==============================] - 0s 923us/step - loss: 47229988.0000
Epoch 78/100
14/14 [==============================] - 0s 846us/step - loss: 47287080.0000
Epoch 79/100
14/14 [==============================] - 0s 846us/step - loss: 46228960.0000
Epoch 80/100
14/14 [==============================] - 0s 846us/step - loss: 46520068.0000
Epoch 81/100
14/14 [==============================] - 0s 846us/step - loss: 46385520.0000
Epoch 82/100
14/14 [==============================] - 0s 846us/step - loss: 46577500.0000
Epoch 83/100
14/14 [==============================] - 0s 923us/step - loss: 46334468.0000
Epoch 84/100
14/14 [==============================] - 0s 923us/step - loss: 46291264.0000
Epoch 85/100
14/14 [==============================] - 0s 846us/step - loss: 45952480.0000
Epoch 86/100
14/14 [==============================] - 0s 769us/step - loss: 46196100.0000
Epoch 87/100
14/14 [==============================] - 0s 846us/step - loss: 46544616.0000
Epoch 88/100
14/14 [==============================] - 0s 846us/step - loss: 45891492.0000
Epoch 89/100
14/14 [==============================] - 0s 923us/step - loss: 46436244.0000
Epoch 90/100
14/14 [==============================] - 0s 846us/step - loss: 46673612.0000
Epoch 91/100
14/14 [==============================] - 0s 769us/step - loss: 46148052.0000
Epoch 92/100
14/14 [==============================] - 0s 769us/step - loss: 46131000.0000
Epoch 93/100
14/14 [==============================] - 0s 846us/step - loss: 46123468.0000
Epoch 94/100
14/14 [==============================] - 0s 923us/step - loss: 46411020.0000
Epoch 95/100
14/14 [==============================] - 0s 846us/step - loss: 46318272.0000
Epoch 96/100
14/14 [==============================] - 0s 769us/step - loss: 46558356.0000
Epoch 97/100
14/14 [==============================] - 0s 692us/step - loss: 46144964.0000
Epoch 98/100
14/14 [==============================] - 0s 769us/step - loss: 45992452.0000
Epoch 99/100
14/14 [==============================] - 0s 769us/step - loss: 46097556.0000
Epoch 100/100
14/14 [==============================] - 0s 692us/step - loss: 46090004.0000
<keras.callbacks.History at 0x248029e8>
y_pred = ann.predict(X_test)
y_test = y_test.tolist()
d = pd.DataFrame()
d["y_test"] = y_test
d["y_pred"] = y_pred
# MAPE
d["mp"] = (abs(d["y_test"]- d["y_pred"]))/d["y_test"]
(d.mp.mean())*100
13.022884808054105
# Importing the XGB libraries and packages
import xgboost as xg
# Instantiation
xgb_r = xg.XGBRegressor(objective ='reg:linear',n_estimators = 100, seed = 123)
# Fitting the model
xgb_r.fit(X_train, y_train)
[09:36:09] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/objective/regression_obj.cu:188: reg:linear is now deprecated in favor of reg:squarederror.
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
gamma=0, gpu_id=-1, importance_type=None,
interaction_constraints='', learning_rate=0.300000012,
max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=100, n_jobs=8,
num_parallel_tree=1, objective='reg:linear', predictor='auto',
random_state=123, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
seed=123, subsample=1, tree_method='exact', validate_parameters=1,
verbosity=None)
y_pred = xgb_r.predict(X_test)
d = pd.DataFrame()
d["y_test"] = y_test
d["y_pred"] = y_pred
# MAPE
d["mp"] = abs((d["y_test"]- d["y_pred"])/d["y_test"])
(d.mp.mean())*100
3.632458679991241
The readers of this blog might mail their opinion how to further improve the model and you will get our contact details here.
To know about all our courses please click here
If you want to read more such case studies then click on Whom should you ask for donations for a charity or Identify if a patient has cancer
Regression problems can be found at House Price Prediction and Insurance Premium Prediction
How to use Machine Learning in Real Estate companies, How to predict the price of 2nd hand cars
Learn Pandas Group by function or How to get a job in Data Science